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Received March 4, 1999 
With the current and ever-erowine off>rW rt f 

vim,al liberies acc= s sibl= I JlfiS? »'" «* ™ Nette of „r E . ni c reactioos 
representative subsets for experimental!™ « dramatically increased in size Yet 1,2 , 

has been some ^t^^^r^r "t" fe <»mbi„ a ,„r , IterS S 

space, should be investigated prior to embarking KiT? 7^ ^ S P"«* p ot 

of desenptors are studied including m^imS ^ r^"**** considerations. Several class* 
desenptors. Ammonal fingerpnnts (ISIS and Daylight) and physico hem a 



INTRODUCTION 

products, m this context, we define "best „f 
jechniones involves 

™ *r a " s,ns , d r™ 8 "totals 

methods. A second class of techniques involve, , he i c f „ 

and 1? T 0f r d " Ca » of el tering!" 

and dissimilarity-based method^- is \Z, i 7 

««s case, the combinatory! array may be maintained by the 
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SlJKi**"" ^en previously 

he ptoblen, A rea S ent array „f 5 „ x 150 x 200 x 350 & r 
a fonr sutanluenl system » » „ » „, ™„, ... 
generate 525 miliion produas . ^ ^ ' ^ 

Problems of th ls complex ity S J b ^ ^foS inTh' 
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Figure 1. 



adequately represent the entire library. Combinatorial subsets 
differ from s,mple product subsets in that they correspond 

e ob ain e H arra r f reagentS 80 th3t 3 giVe " ^ -n»o 
be obtained w.thout generating the other cross products A 

number of diversity metrics have been described'by Hassan 

et al. to characterize noncombinatorial selections 15 The 

27 n M fdiVe r f ° r COmb '" a to™i subsets poses 
Afferent problems. Indeed, the combinatorial cons , aim 
usually lmp0 ses that a number of compounds be similar" to 
each other in the resulting subset. Thus characteristic do s 
not reflect the quality of the subset but simply results from 
js combinatory nature. Although most distance-baseS 
diversity functions would not be well behaved under these 
condmons, pairwise dissimilarities have been used in com- 
binatorial optimization procedures. '« However, cell-based 
approaches can often provide faster measurements of space 

ZZ" 8 A nUmber ° f Cel| - based meth ^ '^ve bee, 
mtroduced herem to characterize combinatorial subsets » 

Descriptors. A number of descriptors have been investi- 
gated m their ability to characterize molecular diversity 1 ^" 
ranging from MDL ISIS and Daylight fingerprints to 
topological indices. Investigations from Gillet et al.'« revealed 
hat, ,n the case of Daylight fingerprints, there are significant 
advantages to product-based approaches compared to simple 
eagent-based selections. As descriptors vary considerly 
m encoding molecular structure, there could be s.gnificant 
differences ,n the way diversity is inferred from reagent space 
to product space. s p 

Measurement of Diversity and Space Coverage. Diver- 
sity and space coverage can be evaluated using a number of 

met , h ° dS ' J' 16 " meth0ds ' im P'— d as diversity 
metrics » evaluate how much of the space occupied by the 

from 3 J n T tnC 3ttemptS t0 Se,ect one C0 ^P^nd 
from each cell ,n order to cover as many cells as possible 

However due to the combinatorial constraint, the objective 

to cover al occupied cells can seldom be achieved. The cell- 

dSlS '"^^ metriCS attem P t t0 level ° ut *e 

d. t nbu ion to provide an even allocation of compounds to 

cells. The cell-based density metric attempts to select more 

than one compound from the most populated cells, in order 
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to respect the level of occupancy of each cell. These metrics 
were used and compared as target functions in I eTo 
natonal optimization process. 

cell-based fraction: F = cells occupied by subset/ 

number of occupied cells 
cell-based Chi 2 : x 2 = £(yv. _ N f 

cell-based entropy: S = -£ (/V . Log(yV,.)) 
cell-based density: D = -£ (yVf Log W)) 
where 

N, = number of compounds m cell / for subset 

M, = number of compounds in cell / for complete library 

N mc = average number of compounds per 

cell expected for subset 
X = sum over cell occupied by subset 

Subset Evaluation. Subset evaluation is required as an 
objective measure of the quality of the subset. The quality 

™ 6 v S D T ' S f n 1 aS kS ab " ity f ° le P reSent *e entS 
b,a.y. D stance-based functions may be used toward this 
end to evaluate how much coverage of the entire library may 
be achieved by a given set of compounds. 

METHODS 

Test Libraries. We used the combinatorial library de- 

7 1: E,l, r 24 as a ! irst exampie ^ ^ 

of amino acute was reduced to eliminate enantiomeric 
reagents smce the descriptors used in the subsequent anZ 
are n 0t sensitive to stereochemistry. The resulting com le 
hbiary consisted of a 16 x 18 x 20 array for Rl x R2 x 
M for a total of 5760 products. A library of tripeptides was 
used as a second example (Figure lb). The complete library 
consisted of a 20 x 20 x 20 array for Rl x 12 x R3 Z 
a total of 8000 products. ,n both cases, we .nvesttated 
subsets of 672 compounds corresponding to a 7 x 8 x L2 
hi i ay. 

Descriptors. This analysis involves three sets of descrrn- 
tors: ISIS keys (960 bits), Daylight fingerprints (1024 b^), 
and a set of 43 physicochemical descriptors. The set of 43 
descriptors involved information content indices, structural 
descriptors, thermodynamic descriptors, and topological 
indices K,er and Hall, Balaban, Wiener and Zagreb indifes). 
These descriptors have been previously used in the charac- 
terization of molecules and fragments l5 - 25 

Reagent-Based Selections. Our selection of reagents 
proceeds from a hierarchical cluster analysis (HCA Per- 
formed on each reagent lis,. The same set of descriptors was 
used for characterization of both reagents and products In 
the case of physicochemical descriptors, principal component 
analysis (PCA) was first performed on the original set of 
descriptors. This procedure allowed for weighting of the 
pnnc,pal components to compensate for possible correlations 
between descriptors. 



Des.cn of Comb,nator,al Library Subsets 

five MDS coordinates only explained 40- W „ 1 
tee Ration contained inV^ tive fi ^ r 
. An alternate approach involving clustering MoL 
jn order to provide a one-dimensioLl repTsf n7a ,o of" 

Ssr f luster anaiysis p r ° vides ^ « h r 

ship information by assigning a cluster identification HT>\ 

eitrrt Fo,,owin8 ° n this *»> ,s 

be described as the equivalent of an occupied cell In rw 
case, the number of desired clusters k fi79 , 

iasr from fin « ^ - 

deS^^T ?I d ^ dEaline With Physicochemical 

m of h onln. 66 PrinCipa ' C ° mp0nents 
carrl o,, L gmal ? nance and were kerned sufficient to 
carry out the remainder of the analysis. The three principal 
components were used as the new coordinate sy^n fo 

rrr optimization and 

The combinatorial optimization uses a Monte Cirlo 
procedure starting from a random 7 x 8 x p 1 S 
starting selection of reagents is then optimiz 
an ea hng procedure involving 50 000 steps at r 1 To 00 
300, 100, 30 and 10 with a minimum of 5000 idle steos 
before the optimization is terminated. Given that he S 

d!fS n sSd Ir were performed corres p°" di "g * 

torfal^r, COn,binat0ria ' Se ' eCti0ns - Ra » d ™ combina- 
torial selections were obtained and analyzed to provide n 

: A «* °f >00 randl selecZ wa 

steSS V 6 P, '° CeSS aS described but with 0 
steps for optimization. The seeds used to obtain the rind™ 
selections were the same as those used in 1^ 
opt.mizat.on process. In this fashion, we could reproduc 
the original starting point and assess the 
provided by each optimization trial Indeed TI Z T 
t^t a better starting point provides tSfciSfffi 
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Figure 2. 



combinatorial optimization. Since the converse could akn 
apply, it is nnponant to assess the stabili^ TresuS 
obtained from combinatorial optimization 

tnose N n compounds that were not selected cnn Z 
-presented by our selection. Thus, for ev£ Z of , I m 

~ n compounds not selected Hip m„ , 

djstnbut.on of distances can be plotted ,n a lustoZ ^ r e 

A distribution histogram skewed toward very low distance 
values ind.cates that for every one of the unselecte IV 

a number of occurrences for larger distance values n t is 
case, the selected subset may adequately repre en ' a late 
proportion of the entire library but does ifot prov de f n 
adequate representation for some smaller set of cL pou Js 

JlZZZfr aSme T m ° Uhe qUa ' ity 0f the '""set, 

co^StlSri oh™™ - ; iinimum f tances 

finder of the library ^ Is^^Zt^ 
suteT gh£r ■W^vc quality of the selected 
Noncombinatorial Selections. Noncombinatorial selee- 

of the combinatorial constraint Thi« 'tmovai 

approximate lower bo" D 1 ? P '*° Vlded an 
oouna 10 ^n«,n obtained from combina- 
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method 



random 
reagent-based" 
product-based: eel] f 
product-based: cell x 2 
product-based: cell S 
product-based: cell D 
noncombinatorial'' 



ISIS keys 

39.70 (3.84) 
27.34 

27.75 (0.92) 
34.00 (0.54) 
32.84 (1.27) 
25.94 (0.29) 
22.06 



daylight 
keys 

31.96 (6.27) 
22.98 

19.12(0.62) 
19.98 (0.62) 
19.09 (0.66) 
16.84 (0.42) 
14.33 



physicochemical 
des criptors 

~0.293 (6.6 E-02) 
0.218 

0.207 (8.8 E-03) 
0.200 (1.8 E-03) 
0-202 (3.1 E-03) 
0-226(7.1 E-03) 
0. 1 93 



method 



random 
reagent-based" 
product-based: cell F 
product-based: ceil x 1 
product-based: cell S 
product-based: cell D 
noncombinatorial" 



ISIS keys 



daylight 
keys 



23.20 (6.37) 
17.07 

13.77 (0.44) 
13.98 (0.38) 
13.91 (0.44) 
12.44(0.20) 
11.22 



physicochemical 
descriptors 



15.37(4.21) 
11.72 

7.93.(0.076) 
M. M (1.51) 
7.70 (0.00) 
8.80 (0.57) 
6.37 



0.263 (4.3 E-02) 
0.214 

0.180 (6.3 E-03) 
0.176 (5.9 E-03) 
0.175 (4.2 E-03) 
0.169 (4.0 E-03) 
0.161 



tonal selections. This assumption is consistent with the 
arguments developed by Gillet et al '« 

wo^Sn a It sing cpu Riooo ° sijic ° n G ^ 

workstation), a single run was performed for each descriptor 
RESULTS 

Benzodiazepine Library. A summary of the results for 
the benzodiazepine library is provided liable S D 
values corresponding to the different subset selection fech" 
niques. Numbers in parentheses indicate the standa d d v - 
tion over selections for which 100 runs were performed 

Tnpeptide Library. A summary of the iLteZTihe 
tnpepttde library is provided in Table 2 



Benzodiazepine Library - Isis Fingerprints 
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(Figure 4), and physicochc™^^^ 1 ) Z 
graphs show the mean and range of D , Z f u 

»P*Bcl de»,p,„ r , mos , op,,™™, "„ ;™ S L°d 

we e ob ve 1 ™f ^ SUbStantial difft ™ 

optimizat.on methods provided significant y more„™ 
tatjve subsets compared to the reagent-based sel c " 

" ;? cting f e,se and -^^£2 

show " oMS, ^ ra 'T- p ReSl " tS f0 '- the " 'Peptide library are 
hown fo. ISIS keys (hgure 6), Daylight keys (Figure 7) 
and physicochemical descriptors (Figure 8) } ' 
For all sets of descriptors investigated, the combinatorial 
opnm.zat.on methods performed significantly oetter Z 
ample reagent-based select.ons. Thil observaln was a so 
■"dependent of the cell diversity metric used. As ob^ed 
with the prev.ous library, the combinatorial optimS 
methods provided very narrow ran.-es of n T'™ I0n 
verv stnM* h*u„ ■ ■ g ot D ™*"» indicating a 

very stable behavior w.th regard to the starting point usJd. 

DISCUSSION 

. Cluster Analysis. Cluster analysis provides valuable 
results that can be used for combinatoria optim zatfon As 
discussed earner, MDS coordinates did not etain s ffiden 
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distance information from fingerprints to be used as a lower 
dimensionality space. Cluster IDs assigned to compounds 
cZtl 3 0ne - dime » si0 -l representation fZld 
combmatonal optimization could be performed The super J 
results obtained with product-based combinatoria o, I 
tior compa red to simple reagent selection attest to ti e 
usefulness of this technique, ft should be noted that since 
the number of desired clusters is known at the outset Tast 
relocation catering may be performed, even on la^'dt 

Reagent-Based versus Product-Based. In our observa 

compared to simple reagent selection depended very much 
on the descriptors used in the analysis and also on the namr 
of the hbrary. I„ the case of the benzodiazepine library, using 
JSIS fingerprints, there was little or no improvement obtained 
from combmatonal optimization over a simple recent 
selection. The choice of cell metric had only marginal el 
on this outcome. With the selected set o/physfcochem c 



descriptors, the improvements were consistent but relatively 
small m regard to the added complexity of the product^ ed 

srt the other hand ' when usin * 

pnnts, the improvement was not only consistent but also 
ju stantia, . highhghting the advantages of a picls 
S ^ ?" dmfi W1S C °"~ with the observations 

In the case of the tripeptide library, there were always 
■gnificant advantages obtained from combinatorial optimiza- 
.on regardless of the choice of descriptors. This behavior 

Z£ baf, 10 d , e8eneraCy ° f thC ^ " sts - 
.eagent-based approach, selection of Rl, R 2 , and R3 are 

made independently of each other. For example the select! 

o residues for R3 docs not incorporate knowledge o I 

selections made for Rl and R2. Thi.s results in signif.c n 

o^rlaps between R group | lsts . On the other hand the 

combmatonal optimization provides simultaneous analysis 

and optimization of Rl, R2, and R3 substituents 
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Another key advantage of the product-based approach is 
the ability to provide additional constraints on the produc s 
Lerti "pf" Characterist - based on whole n ecu 

fh7combL? eS ?° nStrai ^ Ca " be mana « ed in additi °" to 
the s combinatorial constraints used in this study 

C T,! 0rS - WC ° bSerVed SubstantiaI diff ^nces across 

SLL n IT, 0 "' N ° teWOrthy iS * e difftre " ce between 
ISIS and Daylight fingerprints, which displayed very different 
behav , „• from * * - 

space. It appears that, ,n the case of ISIS fingerprints 

extended °? 6 PT fr S StmCtUral e,ements incl "des "ew 
extended paths and therefore does not reach into the core 

bou C ea°ch R *" * ^ ^ 9) ' Hence ' info "™^ " 
other R ptti P ° SS ' bly enC ° ded ''"^pendently of 

other R groups m the molecules. On the other hand we 
suspect that Daylight fingerprints encode path information 
spanning from each R group into the core and possib™ s " 
o her R groups in the molecules, fn the case of a mixtu of 
phys.cochemical descriptors, we expect that the behavTo wu 



depend on the nature of the descriptors used.' hi aur case 
« Lkely observed an average resulting from contributions' 
f om several md.v.dual behaviors. Descriptors such as low 
o de connectivity indices span short paths and therefore 
should behave similarly to ISIS fingerprints. Conversely high 

0 der connecnvty indices span extended paths and therefore 
should behave s.milarly to Daylight fingerprints 

Comb.natorial Optimization. The combinatorial opti- 
ma ion process attempts to identify a selects of reagem 
which proves the best coverage of product space In 2 
case, the process optimizes a 7 x 8 x .2 arrJy from Z 
complete array of ,6 x ,8 x 20 (benzodiazepine lib^ 

01 20 x 20 x 20 (tripeptide library) for R! x R2 x rT 
Even , n case of the relatively simple benzodiazepine library' 
the total number of possible 7x8x12 subsefs is C,^' 
C,." x Cm 2 = 6 3| x |0L1( a formjdab]e .6 x 

would be ^poss.ble to systematically investigate evert 
possible subset, we rely on the Mon.e Carlo procedure * 
p.ov,de a near-optimal solution. Related studies sing genet c 
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algorithms suggest that the subsets obtained with such 
procedures are only slightly suboptimal." 

bv r L°f qUant | fy the St3bi,ity of the soluti °» Provided 
de 1 iontr" t PtU r ti0n ' We C ° mpared the^nmdart 
procedure totd T ■"' Monte Car '° °P timizatio " 
samni;^ k , T n D «» obtained in random 
samples (Tables 1 and 2). We observed that the standard 
dcv.at.on from the optimization runs was consideraWy t ' 
than was obtained in the random runs. This result confirms 



C«ll-X2 Cell-S 
Subsetting Method 



CeH-D Non Combi 



the reliability of the optimization process with little denen 
dency on the starting random selection P 

Limitations. We recognize that, while considering a real 
combinatorial library, this analysis is purely theo etica M 
would be interesting to compare the pLnLcZ^Lll 
ot purely from theoretical coverage considerations but also 
fiom experimental screening results 

Our investigations examined the dependence on several 
parameters but for only two libraries, it would vaS 
c .repeat similar experiments with different size and ^ 
of hbranes. For example, libraries with different ^ ze 

nv™t^ e wrb di r fbrent | SPaCin8 °^ ^ groups could Tb 
investigated. We believe these factors may affect the 
differences observed between product-based Ld Sgent! 

SpST the "' dependenCieS °" the ^1 of 

Another factor that is likely to influence the mapping of 
d.versity fr 0m reagent space to product space i the 
complexity of each reaction step involved in the e abo at on 
of the combinatorial library. We envision that, 
possible rearrangements depending on the nature of the 
reagents, then diversity will less obviously map Z reagent 
c .pro due* Under these conditions, then La b Zi ZSl 
advantages gamed from product-based approaches 

CONCLUSION 

Our results indicate that, in some cases, better subsets mav 
be obtained through product-based combinatoria op S 
t.on compared to simple reagent-based considerate The 
benefi gamed from product-based approaches caTSg% 
depend on the type of descriptors used in the analysif It 
may be ranked in the following order: Daylight fing erprin 

phys.cochemical descriptors > ISIS fingerprints^ We 

,°H h C t] ]1' F ° r 8 Ch ° Sen Set of descriptoi*,*. prel mini 
study should be performed to investigate the benS o/I 
product-based approach. a 
Other factors such as library size will influence the final 
approach s.nce considerable effort may be necessary to 

ZSs The 3 libl K neS a,Kl COmPUte thC Ch0Se " S f 
desenptors. The combinatorial optimization itself proved to 
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be very fast requiring less than 1 m in of CPU time for 

35 tsr tion using the ^ ^ 

In the case of three-dimensional descriptors such ls thosr- 
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